Erkennung von Credit Card Default mit Machine Learning

downloading the dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls

In the followings a simplified description of the variables:

limit_bal: The amount of the given credit (NT dollar)
sex : Gender
education: Level of education
marriage: marital status
age: Age of the customers
payment_status_(month): Status of payment in one of the previous 6 months
bill_statement_(month): The amount of bill statements (NT dollars)in one of the previous 6 months
previous_payment_(month): The amount of previous payments (NT dollars) in one of the previous 6 months

The target variable ['default_payment_next_month'] indicates whether the customer defaulted on the payment in the followings month.

Laden des Datensatzes und Verwaltung der Datentypen

1.1.1 Import the libraries:

1.1.2 Separate the features from the target (y):

1.1.3 Inspecting the data types:

1.1.4 Memory optimization:

1.1.5. Convert object columns to categorical :

EDA ( Exploratory Data Analysis )

1.2.1 Zusammenfassende Statistiken für numerische Variablen:

1.2.2 Zusammenfassende Statistiken für kategoriale Variablen:

1.2.3 Die Altersverteilung und nach Geschlecht :

Kommentar :

Wir haben festgestellt, dass alle ~10 Jahre einige Spitzen auftreten, was auf das Binning zurückzuführen ist. Nachfolgend haben wir das gleiche Histogramm mit sns.countplot und plotly_express erstellt. Auf diese Weise hat jeder Alterswert ein eigenes Feld, und wir können das Diagramm im Detail untersuchen. In den folgenden Diagrammen gibt es keine derartigen Ausschläge:

1.2.4 Die Altersverteilung und nach Geschlecht (Histogramm mit sns.countplot und plotly_express):

Wir können die Geschlechter trennen, indem wir das Argument hue angeben:

1.2.6 Definieren eine Funktion zur Darstellung der Korrelations-Heatmap:

Kommentar:

Aus der PLOT geht hervor, dass das Alter mit keinem feature korreliert

Wir können die Korrelation zwischen den (numerischen) features und dem Zielvariable prüfen:

1.2.7 Plotten die Verteilung des limit balance für jedes Geschlecht und Bildungsgrad:

Kommentar:

Nach Recherchen wurden einige Muster gefunden:

1.2.8 Untersuchen die Verteilung der Zielvariablen nach Geschlecht und Bildungsgrad:

Kommentar:

- Aus der Studie geht hervor, dass der höchste Prozentsatz an Zahlungsausfällen bei männlichen Kunden zu verzeichnen ist.

1.2.9 Untersuchung des Prozentsatzes der Zahlungsausfälle nach Bildungsstand:

Kommentar:

- Auf der Grundlage von Untersuchungen kann man sagen, dass Zahlungsausfälle häufiger bei Kunden mit "High school" 
und seltener bei Kunden aus die Kategorie "Sonstige" auftreten

Aufteilung der Daten in einen Trainings- und einen Testsatz

1.3.1 Import the function from sklearn:

1.3.2 Split the data into training and test sets:

1.3.3 Split the data into training and test sets without shuffling:

1.3.4 Split the data into training and test sets with stratification:

Kommentar:

20 % - TestSet, 80 % - TrainSet

"stratify=y" - Random split für unausgeglichene Daten

1.3.5 Verify that the ratio of the target is preserved:

Kommentar: in beiden Stichproben liegt die Kreditausfallquote bei etwa 22,12 %

Aufteilung der Daten in Trainings-, Validierungs- und Testsätze zur Evaluierung und Abstimmung der Hypoparameter des Modells:

Umgang mit fehlenden Werten

1.5.1 Import the libraries:

1.5.2 Inspect the information about the DataFrame:

1.5.3 Visualize the nullity of the DataFrame:

Kommentar: Weiße Balken in den Datenspalten informieren uns über fehlende Werte in den 23 Spalten und Zeilen mit fehlenden Werten.

1.5.4. Define columns with missing values per data type:

  1. Impute the numerical feature:

Verwenden den SimpleImputer (strategy='median'), um fehlende Werte in Spalte ['age'] aufzufüllen

1.5.6. Impute the categorical features:

Verwenden den SimpleImputer (strategy='most_frequent') um fehlende Werte in Spalten ['sex', 'education', 'marriage'] aufzufüllen

1.5.7. Verify that there are no missing values:

Enkodierung kategorialer Variablen

1.6.1. Import the libraries:

  1. Select categorical features for one-hot encoding:
  1. Instantiate the One-Hot Encoder object:
  1. Create the column transformer using the one-hot encoder:
  1. Fit the transformer:
  1. Apply the transformations to both training and test sets:

Anpassung eines Entscheidungsbaum-Klassifikators / Decision tree classifier

  1. Import the libraries:
  1. Create the instance of the model, fit it to the training data and create prediction:
  1. Evaluate the results:

Bewertung unseres Algorithmus und Modells

  1. Plot the simplified Decision Tree:

Auswertung Precision Recall Curve

Implementierung der scikit-learn's Pipeline

  1. Import the libraries:
  1. Load the data, separate the target and create the stratified train-test split:
  1. Store lists of numerical/categorical features:
  1. Define the numerical pipeline:
  1. Define the categorical pipeline:
  1. Define the column transformer object:
  1. Create the joint pipeline:
  1. Fit the pipeline to the data:
  1. Evaluate the performance of the entire pipeline:

Bewertung unseres Algorithmus und Modells

_

Bewertung unseres Algorithmus und Modells

Verfeinerung von Hyperparametern (Tuning hyperparameters) mittels grid search und cross-validation

  1. Import the libraries:
  1. Define the cross-validation scheme:
  1. Evaluate the pipeline using cross-validation:
  1. Add extra metrics to cross-validation:
  1. Define the parameter grid:
  1. Run Grid Search:
  1. Evaluate the performance of the Grid Search:

Bewertung unseres Algorithmus und Modells

  1. Run Randomized Grid Search:
  1. Evaluate the performance of the Randomized Grid Search:

Bewertung unseres Algorithmus und Modells

Getting Ready

Prepare the Pipeline:

From here

(RandomForestClassifier, GradientBoostingClassifier)

How to do it...

  1. Import the libraries:
  1. Create a Random Forest Pipeline:

num_features = X_train.select_dtypes(include='number').columns.to_list() cat_features = X_train.select_dtypes(include='object').columns.to_list()

num_pipeline = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')) ])

cat_list = [list(X_train[column].dropna().unique()) for column in cat_features]

cat_pipeline = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(categories=cat_list, sparse=False, handle_unknown='error', drop='first')) ])

preprocessor = ColumnTransformer(transformers=[ ('numerical', num_pipeline, num_features), ('categorical', cat_pipeline, cat_features)], remainder='drop')

3. Create a Gradient Boosting Trees Pipeline:

There's more

Below we go over the most important hyperparameters of the considered models and show a possible way of tuning them using Randomized Search. With more complex models, the training time is significantly longer than with the basic Decision Tree, so we need to find a balance between the time we want to spend on tuning the hyperparameters and the expected results. Also, bear in mind that changing the values of some parameters (such as learning rate or the number of estimators) can itself influence the training time of the models.

To have the results in a reasonable amount of time, we used the Randomized Search with 100 different sets of hyperparameters for each model (the number of actually fitted models is higher due to cross-validation). Just as in the recipe Grid Search and Cross-Validation, we used recall as the criterion for selecting the best model. Additionally, we used the scikit-learn compatible APIs of XGBoost and LightGBM to make the process as easy to follow as possible. For a complete list of hyperparameters and their meaning, please refer to corresponding documentations.

Random Forest

When tuning the Random Forest classifier, we look at the following hyperparameters (there are more available for tuning):

* n_estimators - the number of decision trees in a forest.

* max_features - the maximum number of features considered for splitting a node. The default is the square root of the number of features. When None, all features are considered.

* max_depth - the maximum number of levels in each decision tree

* min_samples_split - the minimum number of observations required to split each node. When set to high it may cause underfitting, as the trees will not split enough times.

* min_samples_leaf - the minimum number of data points allowed in a leaf. Too small a value might cause overfitting, while large values might prevent the tree from growing and cause underfitting.

* bootstrap - whether to use bootstrapping for each tree in the forest

We define a grid below:

And use the randomized search to tune the classifier:

Gradient Boosted Trees

As Gradient Boosted Trees are also an ensemble method built on top of decision trees, a lot of the parameters are the same as in the case of the Random Forest. The new one is the learning rate, which is used in the gradient descent algorithm to control the rate of descent towards the minimum of the loss function. When tuning the tree manually, we should consider this hyperparameter together with the number of estimators, as reducing the learning rate (the learning is slower), while increasing the number of estimators can increase the computation time significantly.

We define the grid as follows:

And run the randomized search:

Import the libraries:

  1. Create a xgBoost Pipeline:
  1. Create a LightGBM classifier Pipeline:

XGBoost

The scikit-learn API of XGBoost makes sure that the hyperparameters are named similarly to their equivalents other scikit-learn's classifiers. So the XGBoost native eta hyperparameter is called learning_rate in scikit-learn's API.

The new hyperparameters we consider for this example are:

We define the grid as:

For defining ranges of parameters that are restricted (such as colsample_bytree which cannot be higher than 1.0) it is better to use np.linspace rather than np.arange, because the latter allows for some inconsistencies when the step is defined as floating-point. For example, the last value might be 1.0000000002, which then causes an error while training the classifier.

LightGBM

We tune the same parameters as in XGBoost, though more is definitely possible and encouraged. The grid is defined as follows:

Below we present a summary of all the classifiers we have considered in the last 3 recipes.